Language model
A statistical language model assigns a probability to a sequence of m words P(w_1,\ldots,w_m) by means of a probability distribution. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications. TODO: combining corpus and knowledge bases: http://aclweb.org/anthology/N/N15/N15-1165.pdf TODO: https://arxiv.org/pdf/1611.01628.pdf Types Based on informational content * Cache-based LMsRoland Kuhn and Renato de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12 (6):570–583, 1990.: use a cache window to store statistical temporal information. The motivation for cache-based language models is that language is characterized by the fact that human tends to use language in a bursty way. In other words, a word that occurs in recent history has a higher chance of occurring again in the near future. * Class-based LMsPeter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computational Lin- guistics, 18:467–479, 1992. and topic-based LMsDaniel Gildea and Thomas Hofmann. Topic-based language models using EM. In Proceedings of EUROSPEECH, pages 2167–2170, 1999. exploit the clustering of training data to improve language models. * Structured LMsCiprian Chelba. A structured language model. In Association for Computational Linguistics, pages 498–500, 1997. directly embed the syntactic structure of language into language models. More about structured LMs: "Many researchers attempt to go beyond the word- based language model and augment the translation system with syntax-based language models. Charniak, Knight, and Yamada (2003) design a CFG-based syntax language model for translation output reranking. Shen, Xu, and Weischedel (2008)Shen, L., Xu, J., & Weischedel, R. M. (2008, June). A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model. In ACL (pp. 577-585). propose a dependency language model for the hierarchical phrase-based system (Chiang 2007)). Post and Gildea (2009), Xiao, Zhu, and Zhu (2011) and Zhang, Zhai, and Zong (2013) propose a tree substitution grammar based syntax language model for the string-to-tree translation model. However, these syntax-based language models much increase the decoding time and they are very difficult to be integrated into the phrase-based translation systems which just generate translation outputs phrase by phrase." Based on information encoding scheme * N-gram LMs * Decision tree LMs ** Random forest LMs * Dynamic Bayesian LMs ** Hidden Markov LMs * Exponential LMs * Neural network LMs ** Feed-forward NNLMs ** Recurrent NNLMs * Log-bilinear LMsMnih, Andriy and Geoffrey Hinton (2007). “Three new graphical models for statistical language modelling”. In: Proceedings of the 24th international conference on Machine learning. ICML ’07. Corvalis, Oregon: ACM, pp. 641–648. * Converted LMs: for faster decoding, feed-foward NNLM to back-off N-gram LMArisoy, E., Chen, S. F., Ramabhadran, B., & Sethy, A. (2014). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1), 184-192. Measure Cross-entropy According to Shi (2014)Language Models with Meta-information: "To measure the quality of a language model, one method is to estimate the logarithm like-lihood LP(W ) of test data with n words, which are assumed to be drawn from the true data distribution. \text{LP}(W ) = \frac{1}{n} \sum_i^n\log_2 (P(w_i)) The negative value of this quantity, i.e., −LP(W ) is the cross-entropy. In information theory 91, the cross-entropy H(p, q) of p and q measures how close a probability model q comes to the true model p of some random variable X, which is formulated as: H(p, q) = - \sum_{x \in X} p(x) \log_2 q(x) . " Perplexity According to Shi (2014): "The most commonly used measure for language models is perplexity. The perplexity PL of a language model is calculated as the geometric average of the inverse probability of the words on the test data: \text{PL} = \left( \prod_{i=1} P(w_i |h(w_i)) \right)^{-\frac{1}{t}} , where h(w_i) = w_1, w_2,...,w_{i-1} . Perplexity is highly correlated with cross-entropy. It actually can be seen as exponential of entropy. Note that in most cases, the true model is unknown. Therefore perplexity can be viewed as an empirical estimate of the cross-entropy. Perplexity can be the measure for both the language and models. As the measure for the language, it estimates the complexity of a language. When it is considered as the measure for models, it shows how close the model is to the “true” model represented by the test data. The lower the perplexity, the better the model is. It is important to keep in mind that perplexity is not suitable for measuring language models using un-normalized probabilities. Also perplexity can not be used to compare language models that were constructed on different vocabularies. In these situations, other measures should be chosen." Word prediction accuracy According to Shi (2014): "Word prediction has applications in natural language processing, such as augmentative and alternative communication, spelling correction, word and sentence auto completion, etc. Typically word prediction provides one word or a list of words which fit the context best. This function can be realized by language models as a side product. Looking at this from the other side, word prediction accuracy provides a measure of the performance of language models. Word prediction accuracy is calculated as follows: WPA = \frac{C}{N} where C is the number of words that are correctly predicted. N is the total number of words in the testing. Similar to WER, word prediction accuracy (WPA) is also correlated with perplexity. Intuitively, perplexity can be thought of as the average number of choices a language model has to make. The smaller the number of choices, the higher the word prediction accuracy is. Usually low perplexity co-occurs with a high WPA. However, there are also counterexamples in the literature 159. Compared with perplexity, WPA has less constraints. It can be applied to measure unnormalized language models. It can also be applied to compare language models con- structed from different vocabularies, which happens often in adaptive language models. Compared with the computation of is speech recognizer dependent, WER , WPA WPA is much easier to calculate. Where WER does not have extra dependencies, which makes it suitable to compare language models used in different speech recognizers, i.e. at different research sites. Word error rate According to Shi (2014): "In speech recognition, the performance of language models is also assessed by word error rate ( WER ), which is defined as WER = \frac{S + D + I}{N} where S, D and I are the number of substitutions, deletions and insertions, respectively, when the prediction hypotheses are aligned with the ground truth according to a minimum edit distance. WER is the measure that comes from speech recognition systems. In order to calculate a WER, a complete speech recognizer is needed. WER is more expensive. Compared with the calculation of perplexity, The WER results are noisy, because speech recognition performance also depends on the quality of acoustic models. Usually low perplexity implies low word error rate. However, this is not always trueStanley Chen, Douglas Beeferman, and Ronald Rosenfeld. Evaluation Metrics for Language Models. In DARPA Broadcast News Transcription and Understanding Workshop (BNTUW), February 1998.Rukimini Iyer, Mari Ostendor, and Marie Meteer. Analyzing and predicting language model improvements. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding., pages 254 – 261, 1997.. Ultimately, the quality of language models must be measured by their effect on real applications. When comparing different language models on the same well constructed speech recognition systems, the WER is an informative metric. External links References Category:Language models